AITopics | advantage estimator

Collaborating Authors

advantage estimator

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Loaded DiCE: Trading off Bias and Variance in Any-Order Score Function Gradient Estimators for Reinforcement Learning

Gregory Farquhar, Shimon Whiteson, Jakob Foerster

Neural Information Processing SystemsFeb-12-2026, 13:28:36 GMT

Neural Information Processing Systems http://nips.cc/

estimator, objective, variance, (15 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.53)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.47)

Add feedback

Think Only When You Need with Large Hybrid-Reasoning Models

Jiang, Lingjie, Wu, Xun, Huang, Shaohan, Dong, Qingxiu, Chi, Zewen, Dong, Li, Zhang, Xingxing, Lv, Tengchao, Cui, Lei, Wei, Furu

arXiv.org Artificial IntelligenceMay-22-2025

Recent Large Reasoning Models (LRMs) have shown substantially improved reasoning capabilities over traditional Large Language Models (LLMs) by incorporating extended thinking processes prior to producing final responses. However, excessively lengthy thinking introduces substantial overhead in terms of token consumption and latency, which is particularly unnecessary for simple queries. In this work, we introduce Large Hybrid-Reasoning Models (LHRMs), the first kind of model capable of adaptively determining whether to perform thinking based on the contextual information of user queries. To achieve this, we propose a two-stage training pipeline comprising Hybrid Fine-Tuning (HFT) as a cold start, followed by online reinforcement learning with the proposed Hybrid Group Policy Optimization (HGPO) to implicitly learn to select the appropriate thinking mode. Furthermore, we introduce a metric called Hybrid Accuracy to quantitatively assess the model's capability for hybrid thinking. Extensive experimental results show that LHRMs can adaptively perform hybrid thinking on queries of varying difficulty and type. It outperforms existing LRMs and LLMs in reasoning and general capabilities while significantly improving efficiency. Together, our work advocates for a reconsideration of the appropriate use of extended thinking processes and provides a solid starting point for building hybrid thinking systems.

arxiv preprint arxiv, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2505.14631

Country:

Asia > Thailand > Bangkok > Bangkok (0.04)
Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
Asia > Middle East > Jordan (0.04)
Asia > Middle East > Iraq > Basra Governorate > Basra (0.04)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning

Bai, Hao, Zhou, Yifei, Cemri, Mert, Pan, Jiayi, Suhr, Alane, Levine, Sergey, Kumar, Aviral

arXiv.org Artificial IntelligenceJun-14-2024

Training corpuses for vision language models (VLMs) typically lack sufficient amounts of decision-centric data. This renders off-the-shelf VLMs sub-optimal for decision-making tasks such as in-the-wild device control through graphical user interfaces (GUIs). While training with static demonstrations has shown some promise, we show that such methods fall short for controlling real GUIs due to their failure to deal with real-world stochasticity and non-stationarity not captured in static observational data. This paper introduces a novel autonomous RL approach, called DigiRL, for training in-the-wild device control agents through fine-tuning a pre-trained VLM in two stages: offline RL to initialize the model, followed by offline-to-online RL. To do this, we build a scalable and parallelizable Android learning environment equipped with a VLM-based evaluator and develop a simple yet effective RL approach for learning in this domain. Our approach runs advantage-weighted RL with advantage estimators enhanced to account for stochasticity along with an automatic curriculum for deriving maximal learning signal. We demonstrate the effectiveness of DigiRL using the Android-in-the-Wild (AitW) dataset, where our 1.3B VLM trained with RL achieves a 49.5% absolute improvement -- from 17.7 to 67.2% success rate -- over supervised fine-tuning with static human demonstration data. These results significantly surpass not only the prior best agents, including AppAgent with GPT-4V (8.3% success rate) and the 17B CogAgent trained with AitW data (38.5%), but also the prior best autonomous RL approach based on filtered behavior cloning (57.8%), thereby establishing a new state-of-the-art for digital agents for in-the-wild device control.

large language model, machine learning, reinforcement learning, (24 more...)

arXiv.org Artificial Intelligence

2406.11896

Country: North America > United States (0.46)

Genre: Research Report (0.82)

Industry:

Information Technology > Services (0.47)
Energy > Oil & Gas (0.46)
Education > Educational Setting (0.46)

Technology:

Information Technology > Human Computer Interaction > Interfaces (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
(4 more...)

Add feedback

Soft policy optimization using dual-track advantage estimator

Huang, Yubo, Wang, Xuechun, Zou, Luobao, Zhuang, Zhiwei, Zhang, Weidong

arXiv.org Machine LearningSep-15-2020

In reinforcement learning (RL), we always expect the agent to explore as many states as possible in the initial stage of training and exploit the explored information in the subsequent stage to discover the most returnable trajectory. Based on this principle, in this paper, we soften the proximal policy optimization by introducing the entropy and dynamically setting the temperature coefficient to balance the opportunity of exploration and exploitation. While maximizing the expected reward, the agent will also seek other trajectories to avoid the local optimal policy. Nevertheless, the increase of randomness induced by entropy will reduce the train speed in the early stage. Integrating the temporal-difference (TD) method and the general advantage estimator (GAE), we propose the dual-track advantage estimator (DTAE) to accelerate the convergence of value functions and further enhance the performance of the algorithm. Compared with other on-policy RL algorithms on the Mujoco environment, the proposed method not only significantly speeds up the training but also achieves the most advanced results in cumulative return.

artificial intelligence, optimization, upstream oil & gas, (18 more...)

arXiv.org Machine Learning

2009.06858

Country: Asia > China (0.15)

Genre: Research Report (0.64)

Industry: Energy > Oil & Gas > Upstream (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Variance Reduced Advantage Estimation with $\delta$ Hindsight Credit Assignment

Young, Kenny

arXiv.org Artificial IntelligenceNov-19-2019

Hindsight Credit Assignment (HCA) refers to a recently proposed family of methods for producing more efficient credit assignment in reinforcement learning. These methods work by explicitly estimating the probability that certain actions were taken in the past given present information. Prior work has studied the properties of such methods and demonstrated their behaviour empirically. We extend this work by introducing a particular HCA algorithm which has provably lower variance than the conventional Monte-Carlo estimator when the necessary functions can be estimated exactly. This result provides a strong theoretical basis for how HCA could be broadly useful.

advantage estimator, estimator, variance, (14 more...)

arXiv.org Artificial Intelligence

1911.08362

Country:

Asia > Middle East > Jordan (0.04)
North America > Canada > Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.88)

Add feedback

Loaded DiCE: Trading off Bias and Variance in Any-Order Score Function Estimators for Reinforcement Learning

Farquhar, Gregory, Whiteson, Shimon, Foerster, Jakob

arXiv.org Machine LearningSep-23-2019

Gradient-based methods for optimisation of objectives in stochastic settings with unknown or intractable dynamics require estimators of derivatives. We derive an objective that, under automatic differentiation, produces low-variance unbiased estimators of derivatives at any order. Our objective is compatible with arbitrary advantage estimators, which allows the control of the bias and variance of any-order derivatives when using function approximation. Furthermore, we propose a method to trade off bias and variance of higher order derivatives by discounting the impact of more distant causal dependencies. We demonstrate the correctness and utility of our objective in analytically tractable MDPs and in meta-reinforcement-learning for continuous control.

estimator, objective, variance, (15 more...)

arXiv.org Machine Learning

1909.10549

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
North America > Canada (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.73)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.47)

Add feedback

Augment-Reinforce-Merge Policy Gradient for Binary Stochastic Policy

Tang, Yunhao, Yin, Mingzhang, Zhou, Mingyuan

arXiv.org Artificial IntelligenceMar-12-2019

Due to the high variance of policy gradients, on-policy optimization algorithms are plagued with low sample efficiency. In this work, we propose Augment-Reinforce-Merge (ARM) policy gradient estimator as an unbiased low-variance alternative to previous baseline estimators on tasks with binary action space, inspired by the recent ARM gradient estimator for discrete random variable models. We show that the ARM policy gradient estimator achieves variance reduction with theoretical guarantees, and leads to significantly more stable and faster convergence of policies parameterized by neural networks.

artificial intelligence, estimator, machine learning, (13 more...)

arXiv.org Artificial Intelligence

1903.05284

Country: North America > United States > Texas > Travis County > Austin (0.04)

Genre: Workflow (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.88)

Add feedback

Expert-augmented actor-critic for ViZDoom and Montezumas Revenge

Garmulewicz, Michał, Michalewski, Henryk, Miłoś, Piotr

arXiv.org Machine LearningSep-10-2018

We propose an expert-augmented actor-critic algorithm, which we evaluate on two environments with sparse rewards: Montezumas Revenge and a demanding maze from the ViZDoom suite. In the case of Montezumas Revenge, an agent trained with our method achieves very good results consistently scoring above 27,000 points (in many experiments beating the first world). With an appropriate choice of hyperparameters, our algorithm surpasses the performance of the expert data. In a number of experiments, we have observed an unreported bug in Montezumas Revenge which allowed the agent to score more than 800,000 points.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Machine Learning

1809.03447

Country:

Europe > Poland > Masovia Province > Warsaw (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Games > Computer Games (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.78)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Deep Reinforcement Learning with Online Generalized Advantage Estimation – Tom Breloff

#artificialintelligenceOct-6-2016, 17:37:56 GMT

Deep Reinforcement Learning, or Deep RL, is a really hot field at the moment. If you haven't heard of it, pay attention. Combining the power of reinforcement learning and deep learning, it is being used to play complex games better than humans, control driverless cars, optimize robotic decisions and limb trajectories, and much more. And we haven't even gotten started… Deep RL has far reaching applications in business, finance, health care, and many other fields which could be improved with better decision making. It's the closest (practical) approach we have to AGI.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback